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In an optimal nonbipartite match, a single population is divided 
into matched pairs to minimize a total distance within matched pairs. 
■^ ■ Nonbipartite matching has been used to strengthen instrumental vari- 

'^N ' ables in observational studies of treatment effects, essentially by form- 

ing pairs that are similar in terms of covariates but very different 
in the strength of encouragement to accept the treatment. Optimal 
CZ3 I nonbipartite matching is typically done using network optimization 

techniques that can be quick, running in polynomial time, but these 
techniques limit the tools available for matching. Instead, we use inte- 
^ , ger programming techniques, thereby obtaining a wealth of new tools 

\^ • not previously available for nonbipartite matching, including fine and 

^sO ' near-fine balance for several nominal variables, forced near balance 

^^ , on means and optimal subsetting. We illustrate the methods in our 

\l ' on-going study of outcomes of late-preterm births in California, that 

^+ I is, births of 34 to 36 weeks of gestation. Would lengthening the time 

f^ ■ in the hospital for such births reduce the frequency of rapid readmis- 

CO ' sions? A straightforward comparison of babies who stay for a shorter 

or longer time would be severely biased, because the principal reason 
for a long stay is some serious health problem. We need an instru- 
ment, something inconsequential and haphazard that encourages a 
^\^ , shorter or a longer stay in the hospital. It turns out that babies born 

JH ' at certain times of day tend to stay overnight once with a shorter 

- - -' length of stay, whereas babies born at other times of day tend to stay 

overnight twice with a longer length of stay, and there is nothing par- 
ticularly special about a baby who is born at 11:00 pm. Therefore, we 
use hour-of-birth as an instrument for a longer hospital stay. Using 
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integer programming, we form 80,600 pairs of two babies wlio are 
similar in terms of observed covariates but very different in antici- 
pated lengths of stay based on their hours of birth. We ask whether 
encouragement to stay an extra day reduces readmissions within two 
days of discharge. A sensitivity analysis addresses the possibility tliat 
the instrument is not valid as an instrument, that is, not random but 
rather biased by an unmeasured covariate associated with the hour of 
birth. Bias can give the impression of a treatment effect when there 
is no effect, but it can also mask an actual effect, leaving the impres- 
sion of no effect, and both possibilities are examined in analyses for 
effects and for near equivalence. 

1. Introduction: Structure, application, data, outline. 

1.1. The effects of changing the norms for treatment. There are settings, 
common in medicine, clinical psychology and criminology, in which certain 
norms govern the treatment assigned to an individual and yet also a recog- 
nition that unique circumstances may justify a deviation from the norm. In 
such a context, we might ask about the effects of changing the norm without 
changing the latitude to deviate from the norm when circumstances warrant 
a deviation. How should one study a situation such as this? 

In the current paper we look at late preterm births of 34 to 36 weeks ges- 
tation in California and ask whether a shift in the norm for length of stay 
in the hospital nursery reduces the frequency of rapid readmission. Late 
preterm babies typically stay in the nursery for a day or two before being 
discharged from the hospital. Should the norm be one day or two days? 
Perhaps a two-day norm reduces the frequency of rapid readmission, or per- 
haps one day is sufficient and the second day is an unnecessary expense. 
Obviously, a baby with serious health problems will and should be kept in 
the hospital as long as is necessary — no one doubts the need to permit de- 
viations from the norm — and shifting the norm for a comparatively healthy 
baby is not intended to alter the special care required by sick babies. We 
would like to compare similar babies subject to different norms — one day or 
two days — but with the same latitude to ignore the norm in specific cases. 
A straightforward comparison of babies who stay many days versus babies 
who stay a single day will inevitably be a comparison of sick and healthy 
babies and will provide no useful information about changing the norm for 
healthy babies. Goyal, Fager and Lorch (2011) describe changes over time in 
the norms for discharge of late preterm babies and suggest that an evaluation 
of the effects of these changes is needed. 

The question just raised — the question about changing the norm for treat- 
ment while granting the same latitude for deviations from the norm — is 
related to the so-called encouragement design [Holland (1988)]; however, 
it asks a different question than is commonly asked in that design. In a 
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randomized encouragement experiment, some people are picked at random 
and encouraged to take the treatment, while the rest are not encouraged; 
however, there is noncompliance and people often do not do what they are 
encouraged to do. Typically, in an encouragement experiment, the goal is to 
estimate the effect of taking the treatment, not the effect of being encour- 
aged to take it, and noncompliance is a nuisance whose consequences are to 
be removed analytically. In the case of changing norms for treatment, devia- 
tions from the norm are not properly called noncompliance, may be entirely 
appropriate, even necessary, and we may have no interest in estimating what 
would happen in a world which forbid deviations. No one wants to discharge 
a sick baby who needs services provided by the hospital, whatever norms are 
adopted for the length of stay of comparatively healthy babies. How would 
outcomes change if the norms changed with no change in the freedom to 
deviate from the norm? Notice that a change in the norm might lead to a 
change in the way the freedom to deviate from the norm is employed. Possi- 
bly, if the norm shifted from two days to one day, more babies would deviate 
from the new one-day norm staying instead the two days they would have 
stayed under the old two-day norm. 

In the case of norms, we are interested in the effects of changing the 
encouragement without removing deviations from what is encouraged. In 
the slightly specialized technical terminology introduced by Angrist, Imbens 
and Rubin (1996), we are interested in the causal effect of encouragement on 
all babies, not its effect on compilers, that is, the estimand of the numerator 
of the Wald estimator, not the estimand of the Wald estimator itself. 

1.2. Is a longer stay in the hospital nursery of benefit to a newborn baby? 
The clock, the hour of birth, may alter whether a newborn baby stays in the 
hospital nursery for one day or two before discharge to face the world for the 
first time. In California, the typical baby born at 3:00 in the afternoon (i.e., 
at 15:00) is discharged the following day, with a median length of stay of 22 
hours, while the typical baby born three hours later at 6:00 in the evening 
(i.e., 18:00) is discharged after two days, with a median length of stay of 
43 hours. To the extent that the hour of birth is itself inconsequential, to 
the extent that the hour of birth tells you nothing about the health of the 
baby, it serves as an instrument, creating variation in length of stay that 
will predict subsequent health outcomes only to the extent that an extra 
day in the nursery is beneficial or harmful. See Angrist, Imbens and Rubin 
(1996) for a general discussion of the use of instrumental variables in causal 
inference. 

An instrument is needed here because a straightforward comparison of 
babies discharged earlier and those discharged much later is likely to be 
severely biased. A baby whose discharge is delayed for several days is very 
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likely to have significant complications requiring prolonged care or observa- 
tion, whereas a baby born at 6:00 in the evening is not an unusual baby. 
Although biases are always conceivable in observational studies, there is no 
compelling reason to anticipate severe biases in a comparison of babies born 
at 3:00 in the afternoon and others born at 6:00 in evening. 

Briefly then, our plan is to form two subsets of babies using just the hour 
of birth, those babies born at times that typically yield a one-day stay and 
those born at times that typically yield a two-day stay. More precisely, we 
use hour of birth to produce pairs of babies with very different anticipated 
lengths of stay (ALOS) based on hour of birth, specifically based on the 
median length of stay for babies born at that hour. In other words, we wish 
to focus attention on an innocuous source of variation in length of stay, the 
hour of birth. Admittedly, our two groups do not always stay one or two 
days, so our groups have heterogeneous lengths of stay; however, unlike the 
hour of birth, variations in length of stay that reflect the health of the baby 
are likely to bias comparisons of other outcomes such as 2-day readmissions, 
and we do not want to use that portion of the variation in length of stay in 
defining our comparison groups. See Malkin, Broder and Keeler (2000) and 
Almond and Doyle (2008) for related tactics. 

An instrument is weak if it barely affects which treatment a baby receives 
and it is strong if it is typically decisive in determining the treatment. Weak 
instruments present substantial problems in part because they contain little 
information [Bound, Jaeger and Baker (1995)] and in part because the in- 
formation they do contain is sensitive to tiny unmeasured biases [Small and 
Rosenbaum (2008)]. Following the theory in Small and Rosenbaum (2008) 
and extending the technique in Baiocchi et al. (2010), we strengthen the in- 
strument by not using all of the babies, forcing the remaining paired babies 
to be further apart in terms of ALOS. Because the strength of an instrument 
affects its design sensitivity, discarding some babies to increase strength can 
increase the power of a sensitivity analysis [Small and Rosenbaum (2008)] 
despite the contrary intuition that we all have from unbiased randomized 
experiments where discarding observations can only reduce power. 

The matching technique we use is a substantial advance over previous 
techniques for this problem and more generally for so-called nonbipartite 
matching problems. We use general integer programming techniques rather 
than the subset of network optimization techniques. As reviewed in Sec- 
tion 3.1, general integer programming techniques are much more flexible in 
what they can do, but in a certain abstract sense they are not as suitable for 
large problems as are network optimization techniques. Despite this abstract 
concern, we did not have difficulty in California pairing 161,200 babies us- 
ing integer programming, although the abstract concern may be relevant in 
other practical contexts. 
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1.3. Data: Late preterms birth in California, 1993-2005. We used state- 
wide discharge data on birth hospitalizations in Cahfornia from 1993 to 
2005 obtained from the California Office of Statewide Health Planning and 
Development. For each baby, there is a UB-92 form describing principal 
diagnoses and medical procedures. These data were linked to birth certificate 
data, maternal hospital records and hospital admissions up to one year after 
delivery. The data included live-born newborns delivered vaginally at late 
preterm (34-36 weeks) gestation who were discharged home. Using ICD-9- 
CM codes, we excluded newborns likely to require neonatal intensive care 
because of major congenital anomalies, surgeries or complications such as 
respiratory distress syndrome or sepsis. The clinical team excluded newborns 
with length of stay > 5 days, on the grounds that prolonged hospitalization 
likely reflects significant complications and possible neonatal intensive care. 

1.4. Outline: A match, a matching algorithm, an analysis. Section 2 de- 
scribes the matched comparison while Section 3 discusses the optimization 
techniques used to create the matched pairs. The optimization uses integer 
programming in a new way on a large scale. An analysis of one key outcome, 
readmission within two days of discharge, is presented in Section 4. The 
analysis tests null hypotheses of both difference and near-equivalence and 
examines their sensitivity to bias from unmeasured covariates [Rosenbaum 
and Silber (2009a)]. For instance, the analysis asks whether an apparent ab- 
sence of effect might be an effect of substantial magnitude masked by biases 
from unmeasured covariates. 

The manuscript presents an application, from conception through design 
to analysis, but the novel methodological aspects are most prominent in the 
construction of the matched pairs in Section 3. These novel elements are 
easier to describe once the match has been presented in Section 2 and the 
distinction between network and integer optimization has been reviewed in 
Section 3.1. The babies did not arrive as treated or control babies; rather, the 
algorithm split one population of babies into pairs so they have very different 
anticipated lengths of stay based on the hour of birth; that is, in the tech- 
nical terminology of optimization theory, this is a nonbipartite match; for 
example, see Edmonds (1965), Derigs (1988) and Korte and Vygen (2008), 
Section 11. Nonbipartite matching has a variety of uses in statistics [Lu et al. 
(2011)], for instance, matching for time-dependent covariates [Lu (2005), Sil- 
ber et al. (2009)] and strengthening instrumental variables [Baiocchi et al. 
(2010)]. Concisely, if perhaps for the moment obscurely, the novel elements 
of the integer programming algorithm in Section 3 include the following: 
(i) the extension of fine balance to nonbipartite matching, including fine 
balance for several variables at once, something that is not possible with 
network optimization, (ii) the extension of optimal subset matching to non- 
bipartite matching, (iii) the simultaneous use of fine balance and optimal 



6 J. R. ZUBIZARRETA ET AL. 

subset matching in nonbipartite matching, (iv) forcing balance on means in 
nonbipartite matching. For a recent survey of the Uterature on matching in 
observational studies, see Stuart (2010). 

2. The matched comparison: Similar covariates, different anticipated lengths 
of stay based on the hour of birth. For each hour of birth, to 23, we com- 
puted the median length of stay in the hospital. For instance, the median 
lengths of stay for babies born at midnight, 11 am and 6 pm were, respec- 
tively, 37 hours, 26 hours and 43 hours. Call this median length of stay for a 
given birth hour the "anticipated length of stay" or ALOS. We formed 80,600 
matched pairs of two similar babies so that one baby in a pair had a much 
longer anticipated length of stay than the other — at least 12 hours, and on 
average about 14 hours. Notice that these two groups of babies are defined 
by their individual hours-of-birth, not their individual lengths of stay. We 
refer to these paired babies as the "long-hour-of-birth" baby and the "short- 
hour-of-birth" baby and abbreviate hour-of-birth as HOB. For instance, a 
baby born at 6 pm might be paired with a baby born at 11 am, where the 
former would be the long-HOB baby and the latter the short-HOB baby. The 
new algorithm we used for this matching is described in detail in Section 3, 
but let us first look at the resulting match, then consider its construction. 

The two babies in each pair were both born in the same year in the same 
hospital, that is, the individual pairs were exactly matched for year and hos- 
pital. Table 1 shows the frequencies for the 13 years, and of course these are 

Table 1 

Babies were matched exactly for 13 years of birth and for 

311 hospitals, and this table displays the counts of babies by 

year. Similar tables, not shown, for 311 hospitals and 

13 X 311 years-by-hospitals also exhibit perfect balance 





Long-HOB 


Short-HOB 




Year of birth, matched exactly 




1993 


7471 


7471 


1994 


7514 


7514 


1995 


7221 


7221 


1996 


6877 


6877 


1997 


6644 


6644 


1998 


6191 


6191 


1999 


5814 


5814 


2000 


5702 


5702 


2001 


5505 


5505 


2002 


5348 


5348 


2003 


5547 


5547 


2004 


5416 


5416 


2005 


5350 


5350 
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Table 2 

Balance for covariates that were either exactly matched 

or finely balanced. The table counts babies, and the total 

count in each subtable is 161,200 babies 





Long-HOB 


Short-HOB 


Birth weight < 2500 grams, finely balanced 


>2500 grams 


72,500 


72,500 


<2500 grams 


8100 


8100 


Gestational age, finely balanced 




34 weeks 


11,133 


11,133 


35 weeks 


22,756 


22,756 


36 weeks 


46,711 
Gender, finely balanced 


46,711 


Male 


42,549 


42,549 


Female 


38,051 
Race, finely balanced 


38,051 


Hispanic 


40,342 


40,342 


White 


24,067 


24,067 


Asian 


7871 


7871 


Black 


6009 


6009 


Other 


2311 


2311 


Health insurance, finely balanced 




Federal 


42,061 


42,061 


HMO 


31,461 


31,461 


Fee for service 


3645 


3645 


Uninsured 


2880 


2880 


Other 


547 


547 


Missing 


6 


6 



Parity, uniparous versus multiparous, finely balanced 

Multiparous 50,145 50,145 

Uniparous 30,455 30,455 

Multiple birth, finely balanced 

Single birth 78,837 78,837 

Muhiple birth 1763 1763 

exactly the same for the short-HOB and long-HOB babies. There is a similar 
exactly balanced table, not shown, for the 311 hospitals, and a much larger 
exactly balanced table, also not shown, for the interaction of year and hospi- 
tal with 13 X 311 = 4043 categories. Table 2 shows that the marginal distri- 
butions of seven other nominal variables were exactly balanced, specifically 
birth weight < 2500 grams, gestational age, gender, race, health insurance, 
parity of the mother, and single or multiple birth. (Because multiple births 
were very rare, we make no special allowance for them.) Indeed, the exact 
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Table 3 

Instrument imbalance and covariate balance m 80,600 matched 

pairs of two babies, one born at a long hour-of-birth (HOB), the 

other born at a short hour-of-birth. The matching is intended to 

construct pairs in which the anticipated length of stay (LOS) based 

on the babies ' hour of birth is quite different, but covariates, such 

as birth weight, are similar. Tabulated values are means. 

Covariates are binary indicators except as noted 



Variable 


Long-HOB 


1 Short-HOB 






Instrument 


Anticipated LOS (hours) 


39.56 


25.48 
Covariates 


Birth weight (grams) 


3064.96 


3065.04 


High school degree 


0.60 


0.60 


Birth injury 


0.01 


0.01 


Oligohydramnios 


0.01 


0.01 


Cord abnormality 


0.04 


0.03 


Disorders of the placenta 


0.01 


0.01 


Eclampsia 


0.00 


0.00 


Chorioamniotis 


0.02 


0.01 


Fever post-partum 


0.00 


0.00 


Gestational diabetes 


0.03 


0.03 


Diabetes mellitus 


0.00 


0.00 


Prom 


0.09 


0.07 



balance seen in Table 2 is found within each hospital, that is, within each of 
the 311 categories. Unlike Table 1, Table 2 exhibits fine balance, not exact 
pair matching; that is, the marginal distributions seen in Table 2 are exactly 
the same, but within a single pair the two babies may differ [Rosenbaum, 
Ross and Silber (2007)]. However, we tried to pair individually similar ba- 
bies whenever possible [Zubizarreta et al. (2011)]. Balance on several other 
covariates is displayed in Table 3. 

Birth weight is the most important prognostic variable that is relevant 
to all babies. For this reason, we matched for birth weight in four ways 
that are described in detail in Section 3. Table 2 shows that the marginal 
distribution of low birth weight < 2500 grams is exactly balanced; this is 
a consequence of a fine balance constraint [specifically (2) in Section 3]. 
Also, Table 3 shows the mean birth weights are reasonably close in the long 
and short HOB groups (3064.96 grams for long-HOB and 3065.04 grams 
for short-HOB); this is a consequence of an approximate mean constraint 
[specifically (4) in Section 3]. The algorithm restricted the number of babies 
mismatched for low birth weight [using (3) in Section 3] so that 97% of 
pairs were individually matched for low birth weight; see Table 4. Finally, 
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Table 4 

Matching for low birth weight < 2500 grams. The marginal 

distributions are identical, as required by fine balance, and 

97% of pairs are on the diagonal, exactly matched for birth 

weight < 2500 grams. The table counts pairs, not babies 







Short-HOB baby 




Long-HOB baby 


<2500 grams >2500 grams 


Total 


<2500 grams 
>2500 grams 

Total 


6909 
1191 

8100 


1191 
71,309 

72,500 


8100 
72,500 

80,600 



an effort was made to pair individual babies with similar birth weights: the 
median absolute difference in weight for paired babies was 49 grams, and 
the upper quartile was 100 grams. The pairing of babies with similar birth 
weights used a robust Mahalanobis distance that included birth weight as 
one of the variables. 

We wanted the long-HOB baby and short-HOB baby to have very different 
anticipated lengths of stay based on their hours of birth. The matching 
algorithm began with all of the babies, splitting them into long and short 
in an optimal manner while selecting an optimal subset to discard. Table 3 
shows that the average anticipated length of stay was 39.56 hours among 
long-HOB babies and 25.48 hours among short-HOB babies. 

How does anticipated length of stay based on hour of birth relate to actual 
length of stay? Table 5 and Figure 1 provide answers. We defined zero days 
as less than 12 hours, one day as between 12 and 36 hours, two days as 
between 36 and 50 hours, and so on, in effect rounding to the nearest 24 
hour unit. In Figure 1, the boxplots on the left for anticipated lengths of 
stay have collapsed into lines because the medians and quartiles are equal: 
typically, long-HOB babies were anticipated to stay two days and short- 
HOB babies were anticipated to stay one day. On the right in Figure 1, 
anticipation often but not always equaled actuality: the median and one 

Table 5 

Actual days in the hospital in matched pairs. The 

table counts pairs, not babies 







Short-HOB 


baby 




Long-HOB baby 


<1 day 


2 days 




>3 days 


<1 day 
2 days 
>3 days 


27,687 

18,746 

2443 


8704 

16,732 

1926 




1684 
2061 

477 



10 
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Forecast LOS Actual LOS 




T r 

Long-HOB Short-HOB 
Matched Group 



o o 

o o 



T r 

Long-HOB Short-HOB 

Matched Group 



Fig. 1. Anticipated and actual length of stay (LOS) in days m 80,600 matched pairs of 
a long-HOB baby and a short-HOB baby. The anticipated LOS for baby ij is the median 
LOS for all babies with the same hour of birth (HOB) as baby ij. The figure on the left 
shows that babies in the long-HOB group were typically anticipated to stay two days (36-60 
hours) while babies in the short-HOB group were anticipated to say one day (12-36 hours). 
The figure on the right is actual length of stay. 

quartile equaled the anticipated stay. Presumably, the decision to keep a 
baby in the hospital for four or more days in Figure 1 is not driven by the 
idiosyncrasy of hour of birth, but rather by serious health problems of the 
newborn. Table 5 describes the actual length of stay in pairs. Because babies 
were paired for important prognostic variables such as birth weight, it is not 
surprising that the two babies in pair often stayed the same number of days 
despite different hours-of-birth. Nonetheless, in a pair, when one baby stayed 
two days and the other stayed one, the odds were 18,746/8704 = 2.2 to 1 
that the long-HOB baby was the one who stayed two days. 

Section 3 describes the new techniques used to construct this match and 
Section 4 presents an illustrative analysis of one important outcome, namely, 
readmission to the hospital within two days of discharge. 

3. Using integer programming to construct the matched comparison. 

3.1. Some algorithmic background: Integer versus network optimization. 
An integer programming problem is essentially a linear programming prob- 
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lem in which the solution is restricted to have integer coordinates rather than 
fractional or real coordinates. Often, the solution is further restricted to a 
subset of the integers, sometimes to or 1. An excellent introduction to inte- 
ger programming is provided by Wolsey (1998) and a more detailed account 
is provided by Schrijver (1986). Integer programs arise in various problems 
in operations research because building 5.5 submarines and 6.5 destroyers 
is actually less sensible than building 6 submarines and 6 destroyers or 5 
submarines and 7 destroyers or perhaps 8 submarines and 5 destroyers. In- 
teger programming shows up in optimal matching because whole babies are 
matched to whole babies. Rounding the solution to a linear program may be 
substantially inferior to solving an integer program, but linear programming 
concepts play an important role in solving integer programs. 
An integer program has the form 

(1) minimize ?7 a subject to Ba<b,a>0 with a integer, 

a 

where B is a given di x d2 matrix, 77 is a given (i2-dimensional vector with 
real coordinates, b is a given di -dimensional vector, and one must find the 
best d2-dimensional vector a with d2 integer coordinates. The form (1) sim- 
plifies the discussion in the current section but, in general, integer programs 
may include both linear inequality constraints (as in Ba < b) and linear 
equality constraints (say, Ca = c) and, indeed, with a bit of juggling, either 
type of constraint may be reexpressed in terms of the other, so a separate 
theory for equality constraints is not needed. In the current paper and in 
most matching problems a is further restricted to have binary, 1 or 0, co- 
ordinates. The binary program is finite — there are 2'^'^ candidate a's — but 
for large d2 the number of candidates suffers a combinatorial explosion and 
considering all of them, one by one, is not possible. In the work here, rj 
and a have double subscripts, rjim and aim, i <m, where r]im is a measure 
of distance on covariates between babies i and m, and agm = 1 if babies 
i and m are paired and aim = if they are not. For instance, with L ba- 
bies, a = (ai2, ai3, 023, • • • i ol-i,l)'^- Then r/-^a is the total covariate distance 
within matched pairs. The matrix B imposes various desired restrictions on 
the match, not least that each baby shows up in at most one pair. 

In (1), if you remove the restriction that a is integral, then you have a 
linear program. The linear program always has a minimizing value of r/ a 
that is at least as small as the integer program, but again that leaves you 
with the rather damp prospect of half a submarine. There is a curious but 
important subset of problems in which the linear programming solution and 
the integer programming solution must be the same, and for these problems, 
known somewhat inaccurately as network optimization problems, especially 
fast algorithms are often available by adapting linear programming tech- 
niques. These problems are called "network optimization" because the most 
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common versions arise from problems expressed in terms of the nodes and 
arcs of graph theory. Somewhat more precisely, there is an integral opti- 
mal solution to a linear programming problem if an integer matrix B is 
totally unimodular, that is, if every square submatrix of B has determinant 
— 1, 0, or 1, a condition that insures via Cramer's rule for matrix inversion 
that linear equations solve with integer solutions. See Wolsey [(1998), Sec- 
tion 3.2] for a precise statement and proof. In R, Hansen's (2007) optmatch 
package, Lu et al.'s (2011) nbpmatching package and Yang et al.'s (2012) 
f inebalance package all use network optimization techniques, specifically 
the techniques of Bertsekas (1981) and Derigs (1988). The restriction of B 
to be totally unimodular is a substantial restriction, and one can do quite a 
bit more with (1) if B is not so restricted, a fact we demonstrate in detail 
in the current paper. 

In abstract theory, solving large integer programs can be very difficult. In 
particular, the general problem (1) is NP-complete [Schrijver (1986), Sec- 
tion 18.1]; however, specific forms of (1) are polynomially bounded [e.g., 
Schrijver (1986), Section 18.6]. In practice, there has been a great deal of 
progress in solving quite large integer programs either exactly or approxi- 
mately. We use IBMs ILOG program CPLEX to solve (1), and it is much 
faster than other programs we have tried. IBM makes CPLEX available to 
academics for free. Corrada Bravo (2005) created a package Rcplex that 
facilitates access to CPLEX inside R and we have used Rcplex on Apple 
and linux machines. In statistical matching, a common tactic is to match 
exactly for a few key covariates [Rosenbaum (2010), Section 9.3] — we did 
this for year and hospital — thereby breaking one large matching problem 
into several smaller ones, each of which can be solved quickly. 

3.2. Nonbipartite matching using integer programming. Generally, we 
wanted to match babies who were similar in terms of covariates but very 
different in terms of anticipated length of stay based on hour of birth. We 
matched exactly for hospital and year of birth, meaning that the two babies 
in a pair were born in the same year at the same hospital. Hospitals vary in 
discharge and readmission practices, so it was important to compare two ba- 
bies in the same hospital. There have been substantial changes in discharge 
and readmission practices over the years, as well as advances in medical 
technique, so matching for year was also important. Exact matching can 
be implemented by simply dividing the population into mutually exclusive 
and exhaustive subpopulations, and performing a separate match for each 
subpopulation. Each subpopulaton consisted of a single hospital over an in- 
terval of years. The rest of the discussion describes the match within one such 
subpopulation, here a subpopulation defined by hospital and year of birth. 

There are L babies in the subpopulation, i= 1,...,L, and a variable 
fl£m! 1 < i <m < L, with aim = 1 if babies i and m are paired and aim = 
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otherwise. So a = (012, • • • ,aL-i,L)^ has dimension (2) and B has (2) 
columns. The first constraint is that a^m € {0, 1} for all i, m, so the problem 
is not just an integer program but a binary program. Now each baby i 
appears in at most one matched pair, and to enforce this, we impose L 
linear inequalities, J2Tn=i ^rne + J2m=e+i ^^m < 1, for £ = 1, . . . , L, which are 
coded as the first L rows of B, where 61 = 1, . . . , 6l = 1. 

In statistics, matching is almost invariably "without replacement," mean- 
ing that no baby appears in more than one pair. The constraint X^^=i a^i + 
"^^=£+1 ^f-m. ^ 1 ensures matching is "without replacement." Because out- 
come data are never used in constructing a inatch, when matching is with- 
out replacement, if the L babies were independent prior to matching, then 
the pair outcomes are conditionally independent in distinct pairs given the 
variables used to construct the match, for instance, covariates and hour of 
birth. In contrast, in matching "with replacement," babies would be used 
repeatedly in different pairs, creating dependence. The analysis in Section 4 
uses existing techniques that are appropriate for conditionally independent 
pairs, but these existing techniques are inapplicable when matching "with 
replacement." Indeed, even in the absence of bias from unmeasured covari- 
ates, Abadie and Imbens (2008) argue that straightforward applications of 
the bootstrap are inapplicable when matching "with replacement," and that 
the specialized techniques of Abadie and Imbens (2006) or Politis and Ro- 
mano (1994) are required to obtain a standard error. 

Suppose L is even and one further equality constraint is added, namely, 
L/2 = '^^^i Sf=m+i '^mt [so B has an L + 1 row consisting of a vector with 
(2) coordinates all equal to 1]. Then setting 77^^ equal to a covariate distance 
between babies I and m and solving (1) would yield a minimum distance 
nonbipartite match that divides the L babies into L/2 nonoverlapping pairs 
to minimize the total of the L/2 distances within pairs. This optimization 
problem can be solved quickly using network techniques [Derigs (1988)] as 
implemented in R in the nbpmatching package [Lu et al. (2011)]. 

In contrast, the remainder of this section imposes additional constraints as 
additional rows of B to achieve specific effects, and these require the integer 
programming formulation. In Section 3.2.1, the marginal distributions of 
several nominal variables are forced to balance exactly, a condition known as 
fine balance, as seen in Tables 1 and 2. In Section 3.2.2, a binary requirement 
is imposed on pairs, while permitting a small fraction of pairs to escape 
the requirement as needed, a condition which together with fine balance 
produced Table 4 for low birth weight, with perfect balance for marginal 
distributions combined with most pairs exactly matched. Section 3.2.3 forces 
the means of a continuous covariate to balance, as seen for birth weight in 
Table 3, while Section 3.2.5 forces the means of the instrument to differ, 
thereby strengthening the instrument, as seen in Figure 1. Fine balance 
is generalized to near-fine balance in Section 3.2.4. Finally, Section 3.2.7 
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adjusts r]im to optimize deletion of some babies while making the remaining 
babies closer on covariates and further apart on the instrument. 

In teaching, multiple linear regression is defined abstractly, and then spe- 
cific ways of coding its predictor matrix are shown to fit useful models, such 
as polynomials or interactions. In parallel, the integer programming solution 
to nonbipartite matching is best viewed abstractly as (1) with aim € {0,1} 
and the first L rows of B requiring J2m=i '^mi + Ylim=l+i '^im < 1- Then one 
obtains a match that meets specific requirements by suitably adjusting B 
and r^fm, as described in Sections 3.2.1-3.2.7. 

3.2.1. Fine balance. Table 2 exhibits fine balance of the marginal distri- 
butions for the seven nominal variables. Fine balance for a covariate means 
that the marginal distributions of the covariate are exactly the same in 
matched treated and control groups, although individual pairs may not be 
exactly matched for this covariate. If a nominal variable has C categories, it 
is represented as C — 1 binary indicators. Let wi be the binary indicator for 
one such category, say, wi = 1 ii baby i is Hispanic and Wi = ii baby i is 
not Hispanic. Fine balance for this category is the linear equality constraint 

L-l L 
(2) ^ ^ aira{Wi-Wm)=0. 

i=l m=l+l 

Fine balance in Table 2 is actually present in every year in every hospi- 
tal; that is, for instance, among babies born in 2000 in hospital 22, the 
number of Hispanic long-HOB babies equals the number of Hispanic short- 
HOB babies. Fine balance was imposed through several linear equality con- 
straints of this form. In principle, an equality constraint (2) may be ex- 
pressed in the formulation (1) as two inequalities or two rows of B, namely, 

YjiZi Y.m.=i+i d-imiwi-Wm) < and Y^i^i T.m=e+i aim{wm-Wi) < 0; how- 
ever, most solvers including CPLEX accept either inequality or equality con- 
straints. In CPLEX, each fine balance constraint (2) becomes one additional 
row of B with an equality constraint. 

Treated- versus-control minimum distance matching with fine balance for 
one nominal variable, possibly with many levels, was proposed in Rosenbaum 
[(1989), Section 3.2] and Rosenbaum, Ross and Silber (2007) using either 
network optimization or the optimal assignment algorithm; however, that 
approach is not applicable in nonbipartite matching and can only balance 
one nominal variable. In contrast, the integer programming formulation of 
fine balance (2) is applicable to nonbipartite matching while balancing one 
or more variables. 

3.2.2. Binary requirements for individual pairs. Let hi^ E {0, 1} be a 
binary variable describing the pairing of two babies, i and m, where we wish 
to sharply limit the number of times that paired babies have hi^ = Ij say. 
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to at most H pairs. Taking H = requires hi^ = for all i and m, whereas 
taking H = 5 permits at most five matched pairs to have hgm = 1. In this 
study, we wanted paired babies to differ substantially in terms of anticipated 
length of stay, so we set him = 1 whenever baby i had an anticipated length 
of stay that was less than 12 hours more than the anticipated length of stay 
for baby m. The linear inequality constraint 

L-i L 

(3) 2_^ Z-^ O^imhlm < H 

i=l m=l+l 

is added as a row to B to impose this constraint with H = 0. In addition, 
within each hospital in each year, a constraint of the form (3) was used with 
him = 1 if babies i and m differed in terms of low birth weight < 2500 grams 
and H was twenty percent of the number of births in that hospital in that 
year. 

3.2.3. Balancing means. For any covariate v, not necessarily a binary co- 
variate, suppose that we wish to ensure that the means in matched treated 
and controls groups differ by at most a number e > 0. Unlike a binary co- 
variate in (2), for a continuous covariate such as birth weight, one cannot 
reasonably take e = 0. Because there are X^^^j^ Sm=^+i ^im matched pairs, 
this requirement is the same as 



(4) 



L-l L L-1 L 

1=1 m=e+l 1=1 m=e+i 



L-l L 

1=1 m=i+l 



Now, because of the absolute values in the constraint (4), this constraint 
is not one linear inequality. However, requiring (4) to hold is equivalent to 
requiring two linear inequalities to both hold, namely, 

L-l L L-l L 

(5) ^ ^ aimivi - Vm - e) < and ^ ^ aimivm- vi- e) <0. 

1=1 m=l+l 1=1 m=e+l 

So, a requirement that the means of v after matching differ by at most e is 
represented in the integer program as two rows of the matrix B. Notice in 
Table 3 that the mean of birth weight is almost the same for the long-HOB 
and short-HOB babies. The same technique was applied to birth injury and 
oligohydramnios in Table 3. 

3.2.4. Near-fine balance. Sometimes fine balance (2) for a binary vari- 
able w is infeasible or just too restrictive. For bipartite matching, Yang et al. 
(2012) proposed a network optimization algorithm for treatment-versus- 
control near-fine balance requiring | X^^Z^ Sm=£+i '^i'm{'W£ — Wm)\ < e rather 
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than (2) for the binary variables w that define categories of a single nomi- 
nal variable, and Yang implemented this in her f inebalance package in R 
which uses network optimization. Just as (4) became two linear inequalities 
in (5), so too I X^^J]^ Sm=^+i (^im{wi — Wm)\ < £ may be split into two linear 
inequality constraints which are imposed using integer programming. Also, 
unlike network optimization, integer programming permits near-fine balance 
for one or more nominal variables in nonbipartite matching. 

3.2.5. Forcing pairs to differ with respect to the mean of the instrument. 
Although we set a minimum requirement of a 12 hour difference in antici- 
pated length of stay using a constraint of the form (3), we wanted the typical 
difference to be larger than the minimum. Specifically, we imposed the re- 
quirement that the mean difference in anticipated length of stay, say, vi, 
should be at least </> = 13 hours, that is, we required 

L-l L L-l L L-l L 

e=l m=e+l £=1 m=e+i l=l m=e+i 

by imposing the linear inequality constraint 



L-l L 



X] (^im{ve - Vm -(!>)> 0- 



1=1 m=e+i 

In Table 3, the anticipated length of stay based on birth hour is 39.56 hours 
for the long-HOB babies and 25.48 hours for the short-HOB babies, an 
anticipated difference of more than 14 hours. 

3.2.6. Using several techniques to balance one covariate. It is possible 
to use several of these devices for the same variable. Birth weight is an 
especially important prognostic variable. We finely balanced the indicator 
of birth weight < 2500 grams in Table 2 using a constraint of the form (2). 
We limited the difference in means of birth weight in Table 3 using a pair of 
constraints of the form (5), and we limited the number of times individual 
pairs {i, in) were mismatched for the indicator of birth weight < 2500 grams 
using a constraint of the form (3). 

3.2.7. Optimal selection of a subset. Recall that our match discards some 
babies and must optimally decide the following: (i) how many babies to 
discard, (ii) which babies to discard, and (iii) how to pair the babies not 
discarded. Extending the technique in Rosenbaum (2012) to nonbipartite 
matching, the objective function ry-^a is 

L-l L L-l L 

(6) 2_^ Z^ O-lm^lm — A 2_^ Z^ ^im 

1=1 m=i+l 1=1 m=e+l 
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or ijim = ^im — A, where uirn is a robust Mahalanobis distance between the 
covariates for babies £ and m, and A is a constant selected by the inves- 
tigator. For discussion of the use of Mahalanobis distances in matching, 
see Rubin (1980), and for a robust Mahalanobis distance, see Rosenbaum 
(2010), Sections 8.3 and 13.11. Because ^^Zi 'Ylim=i+i'^^rn is the number of 
matched pairs, the objective function (6) has the following interpretation. 
When comparing two possible matched samples, say, a^^ and a^^, that sat- 
isfy the constraints with the same number of pairs, (6) prefers the pairing 
with the smaller total distance within pairs. Suppose, instead, aim includes 
A > more pairs than a^^, A = Xl^Ji Yjm=t+i ^im - 4m- Then (6) prefers 

^im *0 '^irn if 
L-1 L L-1 L L-1 L L-l L 

e=l m=i+l 1=1 m=£+l 1=1 rn=£+l 1=1 m=l+l 



or, equivalently, if 



EL — 1 sr^L V^^ — 1 V^ 

(.=1 /-.m=e+l O-tm^tm " 2^i=l 1^ 



--1+1 "-Im^iri 



In words, the match represented by aim had A pairs more than the match 
a^^, so the sum of the distances uim for aim contained A more distances, 
and the total distance within pairs rose by more than AX if (7) holds, so the 
average cost of these A additional pairs was more than A. The objective (6) 
prefers more pairs to fewer pairs if, on average, more pairs may be had for less 
than A and prefers fewer pairs if, on average, they cost more than A. Because 
Uirn and a'^^ pair babies differently, the change in average cost is produced 
by all of the paired babies, not just A babies; see Rosenbaum (2012) for 
detailed discussion. In our case, A was the median of all distances before 
matching, and the algorithm prefers more pairs to fewer pairs providing 
the added pairs are, on average, closer than pairs typically are. Of 231,831 
babies, this value of A paired 161,200 babies. Although it would be possible 
to pair additional babies, each of these additions would, on average, raise the 
distance by more than A, that is, by more than the median pairwise distance 
before matching. One might choose a different A in a different context. 

3.3. Comparison with three other matched samples. Table 6 compares 
the match described in Section 2 with three other sets of matched pairs. 
As noted in Section 3.2, the match in Section 2 insisted on a separation of 
12 hours in anticipated length-of-stay within each pair. Table 6 contrasts 
matching with 12 hour separation to matching with no required separation, 
>9 hours and >15 hours. Two quantities are reported in Table 6: the number 
of pairs and the percent of babies staying more than one day, where one day 
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Table 6 

Comparison of four the actual match required samples with 
different required differences in anticipated length of stay. The 

actual matched required a 12 hour difference in anticipated 
length of stay. This 12 hour required difference is compared with 
0, 9 and 15 hours. The table records the percent of babies staying 
longer than one day and the number of pairs. Because zero days 
is a length of stay less than 12 hours, and one day is a length of 

stay greater than 12 hours but less than 36 hours, the table 

indicates the percent of babies staying longer than 36 hours 







Separation in 


anticipated 


LOS 


Hours 





9 


12 


15 


Long-HOB % 


46.9 


50.2 


52.7 


58.7 


Short-HOB % 


40.7 


38.0 


39.2 


41.8 


Difference % 


6.2 


12.1 


13.4 


16.9 


Number of pairs 


91,053 


90,360 


80,600 


59,678 



is a length-of-stay between 12 and 36 hours. With separation, there is only 
a 6.2% difference between long-HOB and short-HOB births in stays more 
than one day. With 12 hours of separation, the difference is more than twice 
as large, 13.4%. In the terminology of Angrist, Imbens and Rubin (1996), 
the percent of compliers is estimated to be more than twice as large with 12 
hours of separation as with separation. 

Matching is part of the design of an observational study, a task that 
should be completed before outcomes are examined [Langenskiold and Rubin 
(2008), Rosenbaum (2010)], and, in particular, one matched sample should 
be selected as the design without using or examining outcomes. We selected 
the 12 hour match based on its qualities as a matched comparison, for in- 
stance, the covariate balance in Tables 1-5 and Figure 1, and the number of 
pairs and instrument strength in Table 6. The analysis of outcomes for this 
selected match is discussed in Section 4. 

4. Inference: Effects on rapid readmission. 

4.1. Null hypotheses of no effect or substantial inequivalence. We will 
conduct both a test of no effect and an equivalence test for readmissions 
within two days of discharge from the hospital. That is, we wish to ask 
whether our data are compatible with no effect or substantial effects of 
shifting the norm for length of stay. Following Bauer and Kieser (1996), a 
three part null hypothesis is tested, where one part asserts no effect, a second 
part asserts moderately large benefits from a 2-day norm and the third part 
asserts moderately large benefits from a 1-day norm. Because these three 
null hypotheses are logically incompatible with one another, at most one of 
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Table 7 

Readmission withm two days of discharge in matched 

pair. The table counts pairs, not babies 





Observed data 




Short-HOB baby 


Long-HOB baby 


Not readmitted Readmitted 


Not readmitted 
Readmitted 


78,431 1032 
1108 29 



the null hypotheses is true, so all three hypotheses may be tested without 
a correction for testing multiple hypotheses; see Bauer and Kieser (1996). 
In particular, the hypothesis of no effect is a two-sided hypothesis saying 
changing the hour of birth for a baby would not change whether the baby 
is readmitted within two days of discharge. The hypothesis that a norm 
of a one-day length-of-stay is harmful asserts that it caused at least 500 
readmissions that would not have occurred with a two-day norm. Because 
there are 80,600 pairs in Table 7, each pair containing one short-HOB baby, 
500 readmissions is slightly more than one half of one percent of these babies 
(actually 500/80,600 = 0.00620). In Table 5, 18,746 - 8704 = 10,042 more 
long-HOB babies stayed 2 days rather than 1 day, and 500 babies is about 
5% of these 10,042 babies (actually 500/10,042 = 0.0498). The same value, 
500, is used to test the third hypothesis of substantial harm, rather than 
substantial benefit, from a two-day norm. In testing these hypotheses, we 
are concerned about both sampling variability and bias from nonrandom 
treatment assignment. 

4.2. Randomization inference in matched pairs: Viewing hour of birth as 
random. There are I matched pairs, i = 1, . . . ,1 of two babies, j = 1,2, 
one treated, Zij = 1, the other control, Zij = 0, so Zn + Zj2 = 1 for each 
i. In Section 1.2, there were / = 80,600 pairs of babies, or 2 x 80,600 = 
161,200 babies in total, and somewhat arbitrarily we designate short-HOB 
as treatment and long-HOB as control. Babies were matched for an observed 
covariate Xjj, so Xji = Xj2 for all i, but they may have differed in terms of an 
unmeasured covariate Uij, so quite possibly un / Ui2 for many or all i. Write 
Z = {Zii, . . . , Z12) for the 2/-dimensional vector of treatment assignments 
and write Z for the set containing the 2^ possible values z of Z, so z € 2 
if z = (zii, . . . , znY^ with Zij = or Zij = 1 and zn + Zi2 = 1 for each i. If 
iS is a finite set, write |5| for the number of elements of S, so \Z\ = 2 . 
Conditioning on the event Z G 2 is abbreviated to conditioning on Z. 

Each baby has two potential binary 1 or responses, rxij if treated, rcij if 
control, so the effect of the treatment on this baby, namely, 6ij = rxij — fcij, 
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is not seen for any baby ij but the response actually seen from ij is Rij = 
ZijTTij + (1 - Zij)rcij = rcij + Zij6ij; see Neyman (1923), Welch (1937), 
Rubin (1974), Reiter (2000) or Gadbury (2001). Write K={Ru,. . .,Ri2f, 

d = (5ii,...,5/2)^, re = {rcii,---,rci2V, rr = (?^rii, • • • ,rr/2)'^, so rr = 
re + S. Here, 5ij S {—1,0, 1} for each ij and Fisher's (1935) sharp null hy- 
pothesis Hq of no treatment asserts that Hq -.6 = 0. In the discussion here, 
Rij indicates whether baby ij was readmitted, Rij = 1, or not, Rij = 0, 
within two days of discharge from the hospital. If rj-ij = 1 and rcij = so 
6ij = rTij — rcij = 1, then baby ij would have been readmitted if born at an 
hour that would typically lead to a one-day stay and would not have been 
readmitted if born at an hour that would typically lead to a two-day stay, 
so being born at a short-HOB rather than a long-HOB would have caused 
this baby to be readmitted. Aside from Fisher's null hypothesis of no effect, 
greatest interest attaches to hypotheses in which one treatment may cause 
but does not prevent a readmission, Hg^^ : S = Sq with ^o ^ and ^o 7^ 0) 
because hypotheses of this form say that one treatment is clearly better 
than the other. Write J^ = {{rTij,rcij,^ij,Uij),i = 1,. .. ,I,j = 1,2} for the 
potential responses and covariates. 

In a paired randomized experiment, one baby in each pair would be 
picked at random for treatment, the other baby receiving control, with in- 
dependent assignments in distinct pairs, that is, Pr(Z = z\T,Z) = 2~ for 
z € 2. In Section 1.2, hour-of-birth is not randomized, but because hour 
of birth should not pick out a particular type of baby, the hope is that 
Pr(Z = zjj-", ^) is close to the randomization distribution. Section 4.3 exam- 
ines the sensitivity of conclusions to departures of various magnitudes from 
Pr(Z = z|jr,Z) = 2-^. 

The statistic T = J2i=i J2j=i ZijRij is the observed number of readmis- 
sions within two days among babies born at a short-HOB. Some of the 
readmissions recorded in T may have been caused by the short-HOB and 
others might have occurred whether the baby was born at a short or a 
long HOB. The unobservable quantity T^ = X]j=i X]j=i ZijVcij is the num- 
ber of readmissions that would have occurred had all babies been born at 
a long-HOB. Fisher's sharp null hypothesis, Hq : ^ = 0, says that no read- 
mission was caused or prevented by the hour of birth, with the consequence 
that T = Tc. Consider the distribution of T^ in a randomized experiment, 
that is, Pr(rc < k\T,Z) when Pr(Z = z\J^,Z) = 2'^. Define nn to be the 
number of pairs i with rcn = rci2 = Ij 't-oo to be the number of pairs with 
fCii = i"Ci2 = 0, and nio to be the number of pairs with rcn / rci2- If 
Hq : <5 = were true, then Rij = rcij and it would be possible to calculate 
(nii,nio,noo) from the observed Rij's. Because Pr(Z = z\J^,Z) = 2~^ and 
re is fixed by conditioning on T, the I terms ^j=i ZijVcij are independent 

for distinct i, and '^j=iZijrcij is 1 with certainty if the pair is concor- 
dant with rcii = rci2 = 1, is with certainty if the pair is concordant with 



STRONGER INSTRUMENTS VIA INTEGER PROGRAMMING 21 

fCii = fci2 = 0, and is 1 or each with probability 2 if the pair is discordant 
with rcii i^Tci2\ therefore, T^ is the constant n\\ plus a binomial random 
variable with probably of success ^ and sample size nio- Because T = Tc 
when Fisher's sharp null hypothesis Hq : d = is true, it follows that Hq 
may be tested in a randomized experiment by comparing T with the ran- 
domization distribution of T^, and this is essentially the same as McNemar's 
test. 

Let ^0 be a 2/-dimensional with coordinates Sqij G {—1, 0, 1}, and consider 
the hypothesis Hs^^ : 5 = 8q. Not all hypotheses of this form are logically 
compatible with the observed data because Rij — ZijSij = rcij and Rij + 
(1 — Zij)5ij = TTij must both be in {0, 1}. If Hsq is logically incompatible 
with the data, we may reject it with type 1 error rate of zero, so for the 
remainder of the discussion, assume that Hsq is logically compatible with 
the observed data, or briefly compatible. If Hsf^ : 6 = Sq were true (and hence 
compatible), then rcij = Rij — Zijdoij may be calculated from the hypothesis 
and the data, so nn, ?iio, uqq and Tc may be calculated as well, so Tc 
may be compared with the constant-plus-binomial distribution to test Hgg ■ 
Unfortunately, there are many hypotheses Hsq : d = Sq and it is not practical 
to test them all; however, the testing of many hypotheses Hg^ : S = Sq may 
be summarized using a scalar quantity, the attributable effect. 

The attributable effect A = ^^^i X^j=i Zij5ij is an unobservable quantity 
giving the net increase in the number of babies readmitted because they 
were born at a short-HOB; see Rosenbaum (2002a). It is a random variable 
because it depends upon Z, but it is not an observable random variable 
because it depends on S. Among babies born at a short-HOB, we see T = 
X]i=i Ylj=i ZijRij = X]j=i Ylj=i ^ij'''Tij readmissions, whereas these same 

babies would have had Tc = Yli=i Y2j=i ^v'^'Cij readmissions had they been 
born at a long-HOB. If H^^ : 5 = 5q were true, then A may be calculated 

using the hypothesized 6^ as Aq = Ylii=iYlij=i^ij^'^ij-> ^^"^ ^ — ^0 would 
equal T^- 

For the reason noted above, we consider hypotheses Hg^ : 5 = Sq that say 
that one treatment is better than the other in the sense that ^o ^ and 
^0 7^ 0. We will do this twice, once reversing the roles of treatment and 
control, but for the moment consider the hypothesis that a short-HOB may 
cause but not prevent readmissions in the sense that 5o ^ 0- A value of 
Aq is rejected if every hypothesis Hs^ : 6 = 6q with do > and ^o 7^ that 
gives rise to this value of Aq = Yl,i=iYl,j=iZij5oij is rejected; otherwise, 
this value of Aq is not rejected. For all of these hypotheses, T — Aq = Tc 
will be the same number; however, nn, nio and noo typically change with 
6q. For a given Aq, among all hypotheses if^o : 5 = 5q with do ^ and 
do 7^ that yield the same attributable effect Ao, there is one hypothesis 
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Table 8 

Readmission within two days of discharge in matched pair 

adjusted for a null hypothesis Ho .6 = So that attributes 

Y^ Sij Zij = A = 500 readmissions to early discharge 





Data adjusted for Ho •.5 = So 




Short-HOB baby 


Long-HOB baby 


Not readmitted Readmitted 


Not readmitted 
Readmitted 


78,902 561 
1137 



H-^ : d = Sq with Aq = ^j=i X^j=i ZijSoij that is the most difficult to reject, 
so if H^ is rejected, then the associated value of Aq is rejected. In a cohort 

study, as in Section 1.2, this hypothesis H^ : d = Sq has X^^^^ ^ij^ij = 1 
for as many pairs with Rn + Ri2 = 2 as possible; see Rosenbaum [(2002a), 
Section 6] for a precise statement and proof. For instance, if Table 7 had come 
from a randomized experiment, Pr(Z = z\J-, Z) = 2~^ , then Aq > 500 would 
be rejected if McNemar's one-sided test rejected no^ffect in the adjusted 
Table 8, where all 29 pairs with Rn + Rn = 2 have 6oij = 1 and 471 pairs 
with Rii + Rii = 1 have 6oij = 1. Why is this Hg the hypothesis that is most 
difficult to reject among hypotheses with Aq > 500? Intuitively, this Hg has 
Aq = 500 with the most variability because the number of discordant pairs 
nio is as large as possible; see Rosenbaum [(2002a), Section 6] for precise 
discussion. 

If Table 7 had been seen in a randomized experiment, Pr(Z = t\T^Z^ = 
2~ , then the procedure just described would yield the following conclusions. 
Testing the null hypothesis of no effect, Hq : 5 = 0, yields a two-sided P- 
value of 0.105 using McNemar's two-sided test, so no effect is plausible. 
Is a substantial benefit of Aq = 500 from being born at a long-HOB also 
plausible? It is not. McNemar's one-sided test rejects in Table 8 with P- 
value 2.1 X lO"'^^, so it rejects for every H^^^ : 5 = do with ^o > and do 7^ 
and Ao > 500. Reversing the roles of (and notation for) a short-HOB and 
a long-HOB, a substantial benefit of Aq = 500 from being born at a short- 
HOB is rejected with a P- value 2.9 x 10"^^. In brief, if Table 7 had been seen 
in a randomized experiment, the hypothesis of no effect would be plausible, 
whereas a benefit or harm that affected at least one half of one percent of 
babies would not be remotely plausible. Of course. Table 7 is not from a 
randomized experiment. 

Our hope has been that a baby's hour of birth tells you little or nothing 
about the baby and her mother, that is, our hope was that hour of birth 
was nearly random, at least after matching for covariates. We cannot be 



STRONGER INSTRUMENTS VIA INTEGER PROGRAMMING 23 

certain of this, however. It is possible to use drugs to induce or accelerate 
labor, and perhaps the use of such drugs shifts the hour of delivery for some 
mothers, possibly in a fashion that biases randomization inferences based 
on Table 7. Moreover, the distribution of times for vaginal delivery may be 
affected by cesarean sections, which again may be related to aspects of the 
mother or the hospital. How large would such biases have to be to alter the 
qualitative conclusions based on randomization inferences? This is examined 
in Section 4.3 using a sensitivity analysis. 

4.3. Sensitivity analysis in matched pairs: What if birth hour is not ran- 
dom? The assumption in Section 4.2 was that hour of birth is effectively 
random, that it tells you nothing about the baby or the mother or the hospi- 
tal and its staff, so that Pr(Z = z\T, Z) = 2~ for z G Z. The current section 
studies sensitivity of the conclusions to quantified violations of this assump- 
tion. The model (9) for sensitivity analysis used here is discussed in Rosen- 
baum (2002b), Section 4. Other methods of sensitivity analysis in observa- 
tional studies are discussed by Cornfield et al. (1959), Rosenbaum and Rubin 
(1983), Yanagawa (1984), Gastwirth (1992), Marcus (1997), Imbens (2003), 
Diprete and Gangl (2004), Yu and Gastwirth (2005), Wang and Krieger 
(2006), McCandless, Gustafson and Levy (2007), Egleston, Scharfstein and 
MacKenzie (2009) and Hosman, Hansen and Holland (2010), among others. 

One model for sensitivity analysis in observational studies asserts that, 
in the population before matching, treatment assignments are independent 
and two babies, say, ij and ij' , with the same observed covariates, Xjj = Xjj/, 
may differ in their odds of treatment by at most a factor of F > 1, 

..s 1 Fi{Z,, = l\:F)Fr{Z,y=0\T) 

^ ' r-Pr(Z,,=0|^)Pr(Z,,v=l|7-)- ' 

then the distribution of Z is returned to Z by conditioning on Z € 2. Model 
(8) is similar to the sensitivity analysis of Cornfield et al. (1959) and is 
exactly the same as assuming that 

(9) P,(z=zi^,2)=n^5(%2f!£;^ 

^■^ exp(7Uji) + exp(7Ui2) 

(10) = ^^P^^"^^^ with u G [0, If' and 7 = log(r); 

Ebe2 exp(7u^ b) 

see Rosenbaum [(2002b), Section 4] where Uij satisfying (9) is constructed 
from PT(Zij = 1\J-) satisfying (8) and conversely. 

Using either of the two approaches in Gastwirth, Krieger and Rosenbaum 
(1998) or Rosenbaum and Silber (2009b), the one parameter T may be un- 
packed into two sensitivity parameters, one controlling the relationship be- 
tween Uij and treatment Zij, the other controlling the relationship between 
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Uij and response under control rcij- For instance, an unobserved covari- 
ate Uij that both doubles the odds of a short-HOB and doubles the odds 
of readmission within two days is equivalent to F = 1.25, whereas doubling 
the odds of a short-HOB with a four-fold increase in the odds of readmis- 
sion is equivalent to F = 1.5. See Gastwirth, Krieger and Rosenbaum (1998) 
and Rosenbaum and Silber (2009b) for specifics, noting that the approaches 
taken in these two papers differ in general but agree in the case of a binary 
outcome rcij- See Gastwirth (1992) for related results for the method of 
Cornfield et al. (1959). 

Under (8) or (9), sharp lower and upper bounds on the distribution of Tc 
are obtained as a constant plus a binomial random variable with nio trials 
and, respectively, probabilities 1/(1 + F) and F/(l + F), yielding an interval 
of possible P- values for each F > 1; see Rosenbaum (2002a). Consider the 
null hypothesis that being born at a short-HOB sometimes causes but never 
prevents readmission within two days such that at least Aq = 500 readmis- 
sions were caused. Testing the null hypothesis Aq > 500, the upper bound 
on the P-value is 0.040 for F = 1.85 and 0.110 for F = 1.9. Reversing roles 
and testing the less plausible null hypothesis that a long-HOB causes but 
does not prevent readmissions and caused at least 500 readmissions, the up- 
per bound on the P-value is 0.0192 for F = 1.5 and 0.079 for F = 1.55. In 
brief, for it to be plausible that Aq = 500 readmissions were caused or pre- 
vented by short- versus-long-HOB, the unobserved covariate Uij would need 
a F > 1.5. As mentioned in the previous paragraph, a F = 1.5 corresponds 
with a Uij that doubles the odds of delivering at a long-HOB and increases 
the odds of readmission by a factor of four. 

5. Summary: Flexible new tools for nonbipartite matching. When com- 
pared with network optimization [e.g., Derigs (1988)], the integer program- 
ming formulation in Section 3 substantially enlarges the set of tools available 
for nonbipartite matching to strengthen an instrumental variable. Among 
the new tools not previously available are the following: (i) fine or near-fine 
nonbipartite matching for one or more nominal variables (2), (ii) nonbipar- 
tite matching with constraints on imbalances in means (4) , and (iii) optimal 
subset nonbipartite matching using (6), (iv) combining fine balance with op- 
timal subset nonbipartite matching. In the example, this approach formed 
80,600 pairs of two babies who were similar on numerous covariates yet very 
different in anticipated length of stay based on hour of birth. 
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